1 Introduction

As we mentioned before, we have a lot of missing values, and we should impute them. We found that the knn imputing method gave better results during the last part, so we will use it now.

2 Cluster analysis

There are 2 classes in the initial data, but we will not use this information in clustering and will try to determine the optimal number of clusters for each method. Moreover, we have mixed data, both numerical and categorical, so we use Gower’s dissimilarity measure in order to calculate dissimilarity matrix.

df.features <- df1.standarized[, 2:20]
df.class.num <- df1$Class
df.class <- as.factor(df1$Class)
dm <- daisy(df.features, metric = "gower")
## Warning in daisy(df.features, metric = "gower"): binary variable(s) 2, 3, 4, 19
## treated as interval scaled
dm.mat <- as.matrix(dm)

Visualization of dissimilarity matrix after ordering.

3 Dimension reduction method

As we have both numerical and categorical attributes, we can not use PCA (Principal Component Analysis) method. We decided to use MDS (MultiDimensional Scaling).

3.1 MDS (M)

We use standardized data.

Let us look at the scree plot.

The scree plot shows a clear elbow at dimension = 2, which suggests that a 2D solution should be adequate. Now we check out the Shepard diagram:

The plot for d = 2 shows not so big amount of spread around the fitted function, which also indicates a good fit of the 2D solution.

So, we will use MDS for 2 dimensions. The classes are separated but also overlap, so in the future we may have a problem with classification.

To sum up, we have data with two numerical features and a target attribute. The mean of columns is 0.

summary(df.mds)
##        X1                 X2            Class  
##  Min.   :-0.29530   Min.   :-0.325977   0:123  
##  1st Qu.:-0.15952   1st Qu.:-0.088834   1: 32  
##  Median :-0.03263   Median :-0.008753          
##  Mean   : 0.00000   Mean   : 0.000000          
##  3rd Qu.: 0.13709   3rd Qu.: 0.093633          
##  Max.   : 0.39117   Max.   : 0.286861

3.2 Classification (I)

3.3 Clustering (M)

There are 2 classes in the initial data, but we will not use this information in clustering and will try to determine the optimal number of clusters for each method. Moreover, we have mixed data, both numerical and categorical, so we use Gower’s dissimilarity measure in order to calculate dissimilarity matrix.

As we mentioned before, we have two numerical featured, so we use euclidean distance to clusterization. We standardized our data.

df.features.mds <- df.mds[-3]
df.features.mds <- scale(df.features.mds)
dm.mds <- daisy(df.features.mds)
dm.mat.mds <- as.matrix(dm.mds)

Visualization of dissimilarity matrix after ordering.

The dissimilarity matrix looks better than for initial data. We can see that more values near the diagonal are blue.

3.3.1 K-means (M)

As we have only numerical data, we do not need to divide our dataset.

Since it is a partitioning cluster method, we first have to select the number of clusters.

So, the optimal number of clusters for K-means method: * Elbow method: it is difficult to determine the optimal number of clusters, because it is hart to say if it seems like the bend in the knee, but the possible choice is 3; * Silhouette method: it says that the optimal number of clusters is 4; * Gap statistic method: it is strange, but it says that the optimal number of clusters is 1, however, we also see that we can use 3 clusters.

It’s hard to say for sure how many clusters are in the dataset just looking at these statistics, but we can use 3 or 4 clusters in K-means method.

Furthermore, we can run NbClust which computes up to 30 indices for determining the optimum number of clusters in a dataset and then takes a majority vote among them to see which is the optimum number of clusters.

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1 as the best number of clusters
## * 5 proposed  2 as the best number of clusters
## * 5 proposed  3 as the best number of clusters
## * 5 proposed  4 as the best number of clusters
## * 1 proposed  5 as the best number of clusters
## * 3 proposed  12 as the best number of clusters
## * 1 proposed  13 as the best number of clusters
## * 1 proposed  14 as the best number of clusters
## * 2 proposed  15 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

According to the majority rule, the best number of clusters for K-means method is 2, 3 or 4, so we will use them.

As we have only 2 attributes, we can use fviz_cluster to plot our clusters.

We can see that the clusters are well separated for all number of clusters.

The last part of the analysis is to validate the clusters found.

##   cluster size ave.sil.width
## 1       1   93          0.42
## 2       2   62          0.32

##   cluster size ave.sil.width
## 1       1   45          0.35
## 2       2   38          0.33
## 3       3   72          0.50

##   cluster size ave.sil.width
## 1       1   34          0.38
## 2       2   28          0.40
## 3       3   28          0.34
## 4       4   65          0.48

We can see, that the size of one cluster is larger than the others for all number of clusters. The bigger cluster is, the higher silhouette width it has. The average silhouette width is 0.38, 0.41, and 0.42 for 2, 3, and 4 clusters, respectively. For 4 clusters it works better, we can also mention that results are better than for initial data, but we compare it in more detailed later.

Now, let compare our clusters with original classes for 2 clusters.

compareMatchedClasses(df.class.num, kmeans2$cluster, method="exact")$diag[1,1]
## [1] 0.7032258
rand.index(df.class.num, kmeans2$cluster)
## [1] 0.5798911

3.3.2 Partition Around Medoids (PAM) (M)

The next method is Partition Around Medoids. PAM can be used with mixed data and it is less sensitive to outliers.

Since it is a partitioning cluster method, we first have to select the number of clusters.

So, the optimal number of clusters for PAM method: * Elbow method: it is hard to determine the optimal number of clusters, because it is hart to say if it seems like the bend in the knee, but the possible choice is 4; * Silhouette method: it says that the optimal number of clusters is 4; * Gap statistic method: it says that the optimal number of clusters is 1, however, we also see that we can use 3 clusters.

It’s hard to say for sure how many clusters are in the dataset just looking at these statistics, but we can use 3 and 4 clusters in PAM method. But we can also try 2 clusters.

As we have only 2 attributes, we can use fviz_cluster to plot our clusters.

We can see that the clusters are well separated for all number of clusters.

The last part of the analysis is to validate the clusters found.

##   cluster size ave.sil.width
## 1       1   87          0.48
## 2       2   68          0.24

##   cluster size ave.sil.width
## 1       1   69          0.50
## 2       2   39          0.42
## 3       3   47          0.27

##   cluster size ave.sil.width
## 1       1   61          0.50
## 2       2   37          0.35
## 3       3   29          0.32
## 4       4   28          0.41

We can see, that for 2 clusters, the size of clusters is similar, but for 3 clusters, the size of one cluster is almost 2 times larger than the others. The average silhouette width is 0.28 and 0.25 for 2 and 3 clusters, respectively. For 2 clusters the result is a slightly better, but it is still poor one and worse than for the K-means method.

Now, let compare our clusters with original classes.

We can see, that the size of one cluster is larger than the others for all number of clusters. The average silhouette width is 0.37, 0.41, and 0.41 for 2, 3, and 4 clusters, respectively. For 2 clusters it works worse. We can also mention that results are better than for initial data.

Now, let compare our clusters with original classes for 2 clusters.

compareMatchedClasses(df.class.num, pam2$cluster, method="exact")$diag[1,1]
## [1] 0.7290323
rand.index(df.class.num, pam2$cluster)
## [1] 0.602346

3.4 AGNES (I)

During the previous clustering we saw, that it works badly for single linkage, so we will not use it now.

all.complete <- NbClust(data = df.features.mds, diss = dm.mds, distance = NULL, min.nc=2, max.nc=10, method="complete", index="all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 3 proposed 2 as the best number of clusters 
## * 7 proposed 3 as the best number of clusters 
## * 7 proposed 4 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 2 proposed 6 as the best number of clusters 
## * 1 proposed 9 as the best number of clusters 
## * 2 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
all.avg <- NbClust(data = df.features.mds, diss = dm.mds, distance = NULL, min.nc=2, max.nc=10, method="average", index="all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 9 proposed 3 as the best number of clusters 
## * 3 proposed 4 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 3 proposed 9 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1 as the best number of clusters
## * 3 proposed  2 as the best number of clusters
## * 7 proposed  3 as the best number of clusters
## * 7 proposed  4 as the best number of clusters
## * 1 proposed  5 as the best number of clusters
## * 2 proposed  6 as the best number of clusters
## * 1 proposed  9 as the best number of clusters
## * 2 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  3 .

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1 as the best number of clusters
## * 4 proposed  2 as the best number of clusters
## * 9 proposed  3 as the best number of clusters
## * 3 proposed  4 as the best number of clusters
## * 1 proposed  6 as the best number of clusters
## * 1 proposed  7 as the best number of clusters
## * 1 proposed  8 as the best number of clusters
## * 3 proposed  9 as the best number of clusters
## * 1 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  3 .

For complete linkage we will use 3 and 4 clusters, and for average linkage only 3 clusters. In addition, we will use 2 clusters in order to compare it with previous results.

agnes.avg      <- agnes(x=dm.mat.mds, diss=TRUE, method="average")
agnes.complete <- agnes(x=dm.mat.mds, diss=TRUE, method="complete")
fviz_dend(agnes.avg, cex=0.4, k=2) + theme(text = element_text(size=15), axis.text.y = element_text(size=15))
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

fviz_dend(agnes.avg, cex=0.4, k=3) + theme(text = element_text(size=15), axis.text.y = element_text(size=15))

fviz_dend(agnes.complete, cex=0.4, k=2) + theme(text = element_text(size=15), axis.text.y = element_text(size=15))

fviz_dend(agnes.complete, cex=0.4, k=3) + theme(text = element_text(size=15), axis.text.y = element_text(size=15))

fviz_dend(agnes.complete, cex=0.4, k=4) + theme(text = element_text(size=15), axis.text.y = element_text(size=15))

agnes.compl2 <- cutree(agnes.complete, k=2)
agnes.compl3 <- cutree(agnes.complete, k=3)
agnes.compl4 <- cutree(agnes.complete, k=4)

agnes.avg2 <- cutree(agnes.avg, k=2)
agnes.avg3 <- cutree(agnes.avg, k=3)

As we have only 2 attributes, we can use fviz_cluster to plot our clusters.

We can see that the clusters are well separated for all number of clusters.

fviz_silhouette(silhouette(agnes.compl2, dm.mds), xlab="AGNES") + theme(text = element_text(size=15), axis.text.y = element_text(size=15))
##   cluster size ave.sil.width
## 1       1   98          0.38
## 2       2   57          0.30

fviz_silhouette(silhouette(agnes.compl3, dm.mds), xlab="AGNES") + theme(text = element_text(size=15), axis.text.y = element_text(size=15))
##   cluster size ave.sil.width
## 1       1   60          0.56
## 2       2   38          0.41
## 3       3   57          0.17

fviz_silhouette(silhouette(agnes.compl4, dm.mds), xlab="AGNES") + theme(text = element_text(size=15), axis.text.y = element_text(size=15))
##   cluster size ave.sil.width
## 1       1   60          0.51
## 2       2   38          0.35
## 3       3   34          0.18
## 4       4   23          0.43

fviz_silhouette(silhouette(agnes.avg2, dm.mds), xlab="AGNES") + theme(text = element_text(size=15), axis.text.y = element_text(size=15))
##   cluster size ave.sil.width
## 1       1  139          0.24
## 2       2   16          0.64

fviz_silhouette(silhouette(agnes.avg3, dm.mds), xlab="AGNES") + theme(text = element_text(size=15), axis.text.y = element_text(size=15))
##   cluster size ave.sil.width
## 1       1   76          0.42
## 2       2   63          0.21
## 3       3   16          0.61

Now, let compare our clusters with original classes for 2 clusters.

rand.index(df.class.num, agnes.compl2)
## [1] 0.6083787
rand.index(df.class.num, agnes.avg2)
## [1] 0.7020528

3.5 DIANA (I)

diana.all <- diana(x = dm.mat.mds, diss = TRUE)
fviz_dend(diana.all, cex=0.4, k=2) + theme(text = element_text(size=15), axis.text.y = element_text(size=15))

fviz_dend(diana.all, cex=0.4, k=3) + theme(text = element_text(size=15), axis.text.y = element_text(size=15))

fviz_dend(diana.all, cex=0.4, k=4) + theme(text = element_text(size=15), axis.text.y = element_text(size=15))

diana2 <- cutree(diana.all, k=2)
diana3 <- cutree(diana.all, k=3)
diana4 <- cutree(diana.all, k=4)

fviz_silhouette(silhouette(diana2, dm.mds), xlab="DIANA") + theme(text = element_text(size=15), axis.text.y = element_text(size=15))
##   cluster size ave.sil.width
## 1       1  101          0.43
## 2       2   54          0.31

fviz_silhouette(silhouette(diana3, dm.mds), xlab="DIANA") + theme(text = element_text(size=15), axis.text.y = element_text(size=15))
##   cluster size ave.sil.width
## 1       1  101          0.32
## 2       2   32          0.42
## 3       3   22          0.47

fviz_silhouette(silhouette(diana4, dm.mds), xlab="DIANA") + theme(text = element_text(size=15), axis.text.y = element_text(size=15))
##   cluster size ave.sil.width
## 1       1   68          0.38
## 2       2   33          0.35
## 3       3   32          0.33
## 4       4   22          0.43

compareMatchedClasses(df.class.num, diana2, method="exact")$diag[1,1]
## [1] 0.7935484
rand.index(df.class.num, diana2)
## [1] 0.6702137

3.6 Fuzzy C-means (M)

Since it is a partitioning cluster method, we first have to select the number of clusters.

So, the optimal number of clusters for Fuzzy clustering: * Elbow method: it is hard to determine the optimal number of clusters, because it is hart to say if it seems like the bend in the knee, but the possible choice is 3; * Silhouette method: it says that the optimal number of clusters is 3; * Gap statistic method: it says that the optimal number of clusters is 1, however, we also see that we can use 3 clusters.

Just looking at these statistics, we can use 2 or 3 clusters in Fuzzy analysis.

As we have only 2 attributes, we can use fviz_cluster to plot our clusters.

We can see that the clusters are well separated for all number of clusters.

The last part of the analysis is to validate the clusters found.

##   cluster size ave.sil.width
## 1       1   82          0.51
## 2       2   73          0.20

##   cluster size ave.sil.width
## 1       1   64          0.54
## 2       2   44          0.35
## 3       3   47          0.25

We can see, that the size of clusters is similar. The silhouette width for clusters is very different: for the one is 0.46, but for the another it is critical small: 0.10. So the average silhouette width for data is just 0.29.

Now, let compare our clusters with original classes.

compareMatchedClasses(df.class.num, fanny2$cluster, method="exact")$diag[1,1]
## [1] 0.6967742
rand.index(df.class.num, fanny2$cluster)
## [1] 0.5746963

4 Conclusions (I)

5 Further research suggestions (I)